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In recent years, machine learning is attaining higher precision and accuracy 
in clinical heart disease dataset classification. However, literature shows that 
the quality of heart disease feature used for the training model has a 
significant impact on the outcome of the predictive model. Thus, this study 
focuses on exploring the impact of the quality of heart disease features on 


the performance of the machine learning model on heart disease prediction 
by employing recursive feature elimination with cross-validation (RFECV). 
Furthermore, the study explores heart disease features with a significant 
effect on model output. The dataset for experimentation is obtained from the 
University of California Irvine (UCI) machine learning dataset. The 
experiment is implemented using a support vector machine (SVM), logistic 
regression (LR), decision tree (DT), and random forest (RF) are employed. 
The performance of the SVM, LR, DT, and RF models. The result appears to 
prove that the quality of the feature significantly affects the performance of 
the model. Overall, the experiment proves that RF outperforms as compared 
to other algorithms. In conclusion, the predictive accuracy of 99.7% is 
achieved with RF. 
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1. INTRODUCTION 

In the last few years, the implementation and adoption of a machine learning algorithm for heart 
disease diagnosis have been the major focus of researchers [1]. The reason behind the wider adoption and 
application of machine learning and predictive model to heart disease prediction include the promising 
accuracy of the learning model compared to a human expert, the speed, and the cost expenditure spent for 
heart disease prediction or detection. Despite the wider adoption and application of the predictive model to 
heart disease diagnosis and their promising result on heart disease prediction, the performance of the machine 
learning model has still scope for improvement. In the literature, the impact of heart disease feature quality 
on the learning model is largely focused on. Hence, this study is aimed to further investigate the impact of 
heart disease feature quality and explores the most important or informative heart disease features that 
represent the heart disease patient resulting in better predictive outcomes. 

This research focuses on the application of recursive feature elimination with the cross-validation 
(RFECV) method to process heart disease data before the model is trained using a support vector machine 
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(SVM), logistic regression (LR), decision tree (DT), and random forest (RF). The RFECV method is used to 
determine the most relevant risk factor heart disease features that are important for improving the prediction 
outcome of SVM, LR, DT, and RF. Generally, the goal of this research is to investigate the number of heart 
disease features required to develop a more accurate and computationally efficient model for heart disease 
prediction. In addition, the variability of the performance of SVM, LR, DT, and RF models for heart disease 
prediction is explored using the implemented RFECV method. This research follows an empirical 
methodology to experiment using RFECV for feature selection and SVM, LR, DT, and RF for model 
development using real-world data obtained from the University of California Irvine (UCD heart disease data 
repository. The objectives of this research are to answer the questions shortlisted: i) what is the optimal 
number of heart disease feature that maximizes the performance of the SVM, DT, RF and LR model for heart 
disease prediction?; ii) what is the impact of varying the number of features, on the performance of SVM, 
DT, RF and LR model for heart disease prediction?; and iii) among SVM, LR, DT and RF which predictive 
model has high variability of cross-validation score for varying number of heart disease features?. 


2. LITERATURE REVIEW 

Heart disease is a cardiovascular disease that causes death all over the world [2]. The identification 
of heart disease is difficult using common heart disease risk factors such as high blood pressure, high 
cholesterol level, age, sex, and serum cholesterol. Overall, the characteristics of heart disease are complex 
and some of the heart diseases features overlap with other diseases such as chronic kidney disease. Thus, the 
identification of heart disease requires caution and heart disease treatment requires highly experienced 
cardiologists which is usually costly and requires much time and human effort. 

Recently, machine learning is gaining importance in the health care industry as one of the means to 
combat the long-term effect of heart disease on society [3]—[5]. The higher precision, high performance, and 
cost-effectiveness is the major advantage of the predictive model in heart disease identification. With more 
and more patients admitted to hospitals, the diagnosis of heart disease is becoming more challenging. One of 
the major challenges in heart disease identification is that highly experienced cardiologists are required to 
identify heart disease accurately. However, training humans requires much effort and time usually, many 
years for healthcare clinicians to gain the necessary skill and experience in heart disease identification. Thus, 
machine learning has become not only an alternative solution to replace human experts in heart disease 
identification, but also a necessity to aid the decision-making process during heart disease identification. In 
this study, the authors developed a predictive model for heart disease diagnosis using supervised learning 
algorithms specifically, SVM, LR, DT, and RF. The authors have also experimented on the developed model 
using the heart disease dataset collected from the UCI data repository. 

In [6], the researchers evaluated the performance of DT, RF, and artificial neural network (ANN) for 
heart disease diagnosis using the UCI heart disease data repository. The experimentation on DT, RF, and 
ANN shows that artificial neural network outperforms as compared to DT and RF model for heart disease 
detection. Overall, the predictive performance of 85.03%, 79.93%, and 79.93% is achieved with an artificial 
neural network, DT, and RF model respectively. Thus, the performance of an artificial neural network is 
better compared to DT and RF models. Moreover, in another study [7], the researchers applied RF to develop 
a predictive model that predicts heart disease. In addition, the authors experimented on the model with a heart 
disease test-set and the result shows that the performance of the model achieved an accuracy of 94.03%. 

Despite the wide application of supervised machine learning algorithms to heart disease datasets for 
implementation of an automated intelligent model for heart disease prediction [8]—[10], literature shows that 
heart disease feature has an impact on heart disease prediction performance of the predictive model and 
feature selection also reduces the computational cost such as time and memory space. Thus, heart disease 
symptom or feature, which is important to represent a heart disease sample, has to be determined to improve 
the performance of the machine learning model for heart disease prediction. Thus, this research focused on 
the implementation automation of heart disease diagnosis with the RFECV method to obtain a better result on 
heart disease prediction using SVM, LR, DT, and RF models. 


3. METHOD 

To conduct this study, the authors reviewed recently published articles in reputed international 
scholarly journals indexed in Scopus. Then we collected clinical heart disease records and conducted 
exploratory data analysis using statistical methods such as correlation analysis and descriptive statistics. 
Much of the research work in the literature [11]—[16] has employed RF, SVM, LR, and DT to predict the 
framework of heart disease diagnosis. Based on the literature survey, we have selected the four most popular 
supervised machine learning algorithms namely, SVM, DT, RF, and LR to conduct experimental research on 
the performance of SVM, LR, DT, and RF. The heart disease dataset is collected from the UCI machine 
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learning data repository, which is one of the most popular machine learning data repositories for conducting 
experimental research in machine learning research [17]—[21]. The heart disease dataset employed in this 
study is demonstrated in Table 1. 


Table 1. Heart disease dataset characteristics 
Data source No. of instances _No. of patients _No. of non-patients No. of classes 
UCI data repository 1025 526 498 2 (patient and healthy) 


3.1. UCI heart disease feature description 

The UCI heart disease dataset consists of 1,025 sample data points, each data point or sample is 
described by 13 heart disease features described in Table 2. The authors have considered 70% of the dataset 
or 717 samples and the remaining 30% or 308 data samples are used for testing. In addition, the dataset 
consists of balanced observations of the patient and non-patient class distribution. 


Table 2. Heart disease dataset description 


Feature Data type Description Value 
Age Numeric Age of patient Mean=54, Max=77, Min=29 
Sex Nominal Patient’s gender (1=male,0=female) 1=Male, 0=Female 
restecg Numeric Blood pressure in mmHg 0=Normal, 1=Having ST-T, 
2=Showing probability 
Cholesterol (chol) Numeric Continuous value in mm/dl Mean=246, Max=564, Min=126 
Fasting blood pressure (fbs) Nominal Level of sugar in blood >120 mg/dl (1=yes, 0=no) >120 mg/dl, 1=Yes, O=No 
Heart rate achieved (thalach) Numerical Heart rate in mmHg Mean=149, Max=202, Min=71 
Thallium scan(thal) Nominal Nominal (3=Normal,6=fixed defect,7=Reversible 
defect) 
Exercise induced Nominal Nominal (presence of exercise induced 1=Yes, 0=No 
angina(exang) angina, |=present,0=absent) 
Slope (slope) Nominal Nominal (1=Up slopping,2=Flat,3=Down slopping) 1=Up slopping, 2=Flat, 3=Down 
slopping 
Status of fluoroscopy (ca) Nominal Nominal (number of vessels colored through X-ray Continuous values (0-3) 
(0 to 3)) 
Chest pain (cp) Nominal Nominal (With chest pain=1, no chest pain=0) 1=Typical, 2=Atypical, 3=Non- 
angina, 4=Asymptomatic 
oldpeak S-T depression induced by exercise relative to rest Mean=1.07, Max=6.2, Min=0 
nominal (0 to 6) 
Resting blood pressure Nominal Blood pressure at rest Mean=131, Max=200, Min=94 
(tresbps) 
Target Nominal Predicted class 1=Patient, 0=Healthy or not 
patient 


3.2. Correlation model 

To get insight into the heart disease dataset and explore the dependency or collinearity that exists 
among heart disease features we have employed Pearson correlation to the heart disease dataset for 
exploratory data analysis. Figure 1 demonstrates the correlation circle for heart disease features. Pearson 
correlation among each heart disease feature is determined by using the Pearson’s correlation formula given 
in (1) [22]-[24]: 


_ 2@i-x)(yi-y) (1) 
V¥(xi-x)(yi-y) 


where r denotes Pearson correlation coefficient, xi denotes values of the variable x in the heart disease 
dataset, Yi denotes values of the variable y in the heart disease dataset, x denotes the mean of variable x and y 
denotes the mean of the values of variable y in the heart disease dataset. 

Figure 1, shows the correlation of heart disease dataset features. Total resting blood pressure, age, 
fasting blood sugar, cholesterol, and quantity of main vessels colored by fluoroscopy have a positive 
correlation. In addition, sex, heart rate, exercise-induced angina, oldpeak, and heart rate have a positive 
correlation to each other. Similarity, chest pain, slope and maximum heart rate achieved has a positive 
correlation to each other. In contrast, chest pain, maximum heart rate, and slope are negatively correlated to 
sex, heart rate, exercise-induced angina, oldpeak, and heart rate. Similarly, resting electrocardiography has a 
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negative correlation to total resting blood pressure, age, fasting blood sugar, cholesterol, and quantity of main 
vessels colored X-ray. 
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Figure 1. Correlation graph for heart disease features 


3.3. RFECV 

Recursive feature elimination (RFE) is a feature selection technique that fits the model and 
eliminates less important features (or features) until the specified number of features is selected. Features are 
ranked by their important characteristics, and by recursively removing a small number of features per loop, 
RFE eliminates dependencies or collinearity that exists between the features [25]. To determine the relevant 
number of heart disease features, we have employed RFE with cross-validation or RFECV to compute the 
cross-validation score on the selected heart disease feature. 


4. RESULTS AND DISCUSSION 

This section presents the results such as accuracy and cross-validation variability between SVM, 
LR, DT, and RF for varying features. The performance of SVM, DT, RF, and LR is evaluated using 
accuracy, receiver operating characteristics the area under the curve (AUC), and average precision. In 
experimentation, the model is tested against a varying number of input features. Cross-validated accuracy is 
computed for different input feature sizes and the result is compared. 


4.1. The effect of heart disease input feature size on classifier performance 

To determine the optimal number of heart disease features, cross-validation is used with RFE to 
score different feature subsets and select the best performing collection of heart disease features. The RFECV 
is demonstrated in Figures 2(a) and 2(b) and Figures 3(a) and 3(b) show the number of heart disease features 
in the SVM, R, DT, and LR models along with their cross-validated test score and variability respectively. 
Moreover, Figures 2(a) and 2(b) and Figures 3(a) and 3(b) demonstrate the selected number of heart disease 
features for each model. 

Figure 2 proves that the performance of the SVM and RF model is highly affected by the heart 
disease feature used for model training. Figures 2(a) and 2(b) and Figures 3(a) and 3(b) show the RFECV 
curve of the proposed model, the SVM achieves a higher accuracy when 11 informative heart disease 
features are used as shown in Figures 2(a), then gradually decreases the accuracy as the non-informative 
heart disease features are added into the model. Similarly, RF achieves a higher accuracy when 11 
informative heart disease features are used as shown in Figure 2(b). In addition, the shaded region represents 
the variability of cross-validation or standard deviation above and below the mean accuracy score drawn by 
the cross-validation curve. We see from Figures 2(a) and 2(b) that there is a high variability of cross- 
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validation scores with varying heart disease features using SVM as compared to the RF model on heart 
disease prediction. 
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Figure 2. Effects of selecting heart disease feature on the performance of classier (a) SVM and (b) RF 


We see from Figures 3(a) and 3(b) that DT achieves a higher accuracy when 8 informative heart 
disease features are used as shown in Figure 3(a), then remained constant accuracy as the non-informative 
heart disease features are added into the model. Similarly, LR achieves a higher accuracy when 10 
informative heart disease features are used as shown in Figure 3(b). In addition, we see from Figures 3(a) and 
Figure 3(b) that there is a high variability of cross-validation scores with varying heart disease features using 
LR as compared to the DT model on heart disease prediction. 
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Figure 3. Effects of selecting heart disease feature on the performance of classier (a) DT and (b) LR 


4.2. Comparison between the accuracy of the model 

The authors employed accuracy as a performance metric to evaluate and compare the performance 
of SVM, DT, LR, and RF models. In comparison, the highest accuracy achieved by each model is different. 
In addition to accuracy variation across the different models, the experimental result appears to prove that the 
model performance varies for different features. Table 3 illustrates the variation in the performance of the 
model for a varying input feature. 


Table 3. Heart disease dataset feature description 
Model No. of features _ Highest accuracy Avg. AUC _ Avg. precision 


LR 10 85.8% 0.88 0.94 
SVM 11 85.2% 0.84 0.86 
DT 8 99.6% 0.96 1.00 
RF 11 99.7% 1.00 1.00 
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5. CONCLUSION 

In this study, the authors conducted an empirical study on the performance of machine learning for 
heart disease prediction using SVM, LR, DT, and RF. Furthermore, we have employed RFECV to select 
optimal input features to obtain better heart disease diagnosis outcomes. With RFECV, we have determined 
the optimal number of heart disease features that maximize the heart disease diagnosis outcome of the 
proposed model. In addition, the proposed model is compared to the existing model and the experimental 
result shows that the RF model outperforms as compared to DT, SVM, and LR. Overall, a random forest 
model was performed with a classification accuracy of 99.7%. The performance of RF, DT, SVM, and LR 
models is 99.7% and 99.6%. 85.2% and 85.8% respectively. 
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